πŸ“Ί OK, not really!

This is actually about tidyr::pivot_*() functions, which you can learn all about in the new tidyr version 1.0.0 Pivoting vignette!

But, I will be using some nifty TV-related data. So, I’m sticking with the name.

library(tidyverse)

πŸ” Bob’s Burgers: A Belcher Family survey

sheet <- googlesheets::gs_title("bobs_burgers_survey_results")

bobs_ws <- googlesheets::gs_ws_ls(sheet)
raw_dat <- sheet %>% 
  googlesheets::gs_read(ws = glue::glue("{bobs_ws}"))
## Accessing worksheet titled 'Form Responses 1'.
## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Members of the Belcher family with whom I identify (select all that apply)` = col_character()
## )
belcher_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
  dplyr::rename("response" = `Members of the Belcher family with whom I identify (select all that apply)`) %>%
  dplyr::select(-Timestamp)

Because I used a Google Form to collect this data, I don’t have to worry about order of names, since they come out the same every time.

agg_results <- belcher_results %>%
  dplyr::group_by(response) %>%
  dplyr::summarise(total = n()) %>%
  dplyr::arrange(desc(total))

agg_results
## # A tibble: 30 x 2
##    response                       total
##    <chr>                          <int>
##  1 Bob                               49
##  2 Tina                              29
##  3 Louise                            23
##  4 Bob, Tina                         22
##  5 Bob, Louise                       17
##  6 Bob, Tina, Louise                 17
##  7 Tina, Louise                      12
##  8 Bob, Linda, Tina, Gene, Louise    11
##  9 Bob, Tina, Gene                   10
## 10 Bob, Gene                          8
## # … with 20 more rows

tidyr::separate_rows()

Let’s make them long with tidyr::separate_rows(). I’m also adding a numeric variable, identify (short for β€œcharacter(s) with whom I identify”), and ensuring that respondent IDs don’t get erroneously treated as numeric, by converting them to characters.

belcher_results <- belcher_results %>%
  tidyr::separate_rows(response) %>%
  dplyr::mutate(identify = 1,
                resp_id = as.character(resp_id))

head(belcher_results)
## # A tibble: 6 x 3
##   resp_id response identify
##   <chr>   <chr>       <dbl>
## 1 1       Linda           1
## 2 1       Tina            1
## 3 1       Louise          1
## 4 2       Bob             1
## 5 2       Gene            1
## 6 3       Bob             1

tidyr::pivot_wider()

Now we’ll take one of the β€œnew” tidyr verbs for a spin, pivot_wider(). Rather than fill things out with a bunch of NAs, we’ll prepare our data to use with the UpSetR package by turning it into binaries, and ditch the respondent ID in the end.

binary_tib <- belcher_results %>%
  tidyr::pivot_wider(
    names_from = response,
    values_from = identify,
    values_fill = list(identify = 0)
  ) %>%
  dplyr::select(-resp_id)

head(binary_tib)
## # A tibble: 6 x 5
##   Linda  Tina Louise   Bob  Gene
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     1     1      1     0     0
## 2     0     0      0     1     1
## 3     0     1      0     1     1
## 4     1     1      1     1     1
## 5     0     0      0     1     0
## 6     0     0      1     0     0

UpSetR::upset()

I highly recommend Paul Campbell’s code-through using UpSetR, which gave me (among other things) the pro tip that upset() does not like tibbles (hence the as.data.frame() at the end).

binary_df <- as.data.frame(binary_tib)
UpSetR::upset(binary_df, nsets = 5, order.by = "freq")

Bonus: Venn Diagram

In essence, our desired output above (the UpSet plot), dictated the format of our data. If we wanted to use a Venn Diagram (for example, using the VennDiagram) package, we’d want our data in yet another format.

We’ll use out long data frame from before, belcher_results. What we want is a set of respondents who identified with each character. For example, if I wanted just the respondents who chose Bob, I would do the following:

bob <- belcher_results %>%
  filter(response == "Bob") %>%
  pull(resp_id)

I’ll make a little helper function, and do the same for the rest of the family. (Yes, this could be refactored to be much more efficient, but the names of the members of the Belcher Family roll of my fingertips easily enough).

# little brittle helper
make_set <- function(x) {
  belcher_results %>%
    filter(response == x) %>%
    pull(resp_id)
}

linda <- make_set("Linda")

tina <- make_set("Tina")

gene <- make_set("Gene")

louise <- make_set("Louise")

Now I’ll follow this handy tutorial from the R Graph Gallery, and turn this into a basic Venn Diagram

library(VennDiagram)
## Loading required package: grid
## Loading required package: futile.logger
venn.diagram(
  x = list(bob, linda, tina, gene, louise),
  category.names = c("Bob", "Linda", "Tina", "Gene", "Louise"),
  filename = "belcher_venn.png"
)
## [1] 1

πŸ“½ I have seen this movie…

Let’s take a look at another dataset I collected with a quick survey, this one asking people whether they had or had not seen a given movie.

sheet <- googlesheets::gs_title("seen_this_movie")
## Sheet successfully identified: "seen_this_movie"
movie_ws <- googlesheets::gs_ws_ls(sheet)

raw_dat <- sheet %>% 
  googlesheets::gs_read(ws = glue::glue("{movie_ws}"))
## Accessing worksheet titled 'Form Responses 1'.
## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Airplane!` = col_logical(),
##   Anchorman = col_logical(),
##   `Animal House` = col_logical(),
##   `The Big Lebowski` = col_logical(),
##   `The Blues Brothers` = col_logical(),
##   Borat = col_logical(),
##   Bridesmaids = col_logical(),
##   `Office Space` = col_logical(),
##   `Old School` = col_logical(),
##   `This Is Spinal Tap` = col_logical(),
##   `Tommy Boy` = col_logical(),
##   Superbad = col_logical(),
##   `My age is…` = col_double()
## )

First I’ll use tibble::rowid_to_column() to keep track of which user said what (and, again, make sure those are stored as characters rather than numeric, so I don’t accidentally analyze it the wrong way).

I still don’t care about the timestamp, so I’ll get everything but that using dplyr::select(-Timestamp).

The variable names here have a whole bunch of annoying characters (e.g.Β spaces, exclamation points). So, I’m going to use Sam Firke’s janitor package β€” specifically the janitor::clean_names() function β€” to take care of some of the grunt work for me there.

I’m also going to rename my_age_is to age, just because it’s annoying.

movie_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
  dplyr::mutate(resp_id = as.character(resp_id)) %>%
  dplyr::select(-Timestamp) %>%
  janitor::clean_names() %>%
  dplyr::rename("age" = my_age_is)

head(movie_results)
## # A tibble: 6 x 14
##   resp_id airplane anchorman animal_house the_big_lebowski the_blues_broth…
##   <chr>   <lgl>    <lgl>     <lgl>        <lgl>            <lgl>           
## 1 1       TRUE     TRUE      TRUE         FALSE            TRUE            
## 2 2       FALSE    TRUE      TRUE         TRUE             FALSE           
## 3 3       FALSE    FALSE     FALSE        TRUE             FALSE           
## 4 4       FALSE    FALSE     FALSE        FALSE            FALSE           
## 5 5       FALSE    TRUE      FALSE        FALSE            FALSE           
## 6 6       FALSE    TRUE      FALSE        TRUE             TRUE            
## # … with 8 more variables: borat <lgl>, bridesmaids <lgl>,
## #   office_space <lgl>, old_school <lgl>, this_is_spinal_tap <lgl>,
## #   tommy_boy <lgl>, superbad <lgl>, age <dbl>

Since this dataset is wide, let’s also take a gander at it using glimpse():

glimpse(movie_results)
## Observations: 146
## Variables: 14
## $ resp_id            <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ airplane           <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
## $ anchorman          <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
## $ animal_house       <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ the_big_lebowski   <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, …
## $ the_blues_brothers <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
## $ borat              <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE…
## $ bridesmaids        <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,…
## $ office_space       <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ old_school         <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ this_is_spinal_tap <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALS…
## $ tommy_boy          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ superbad           <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, …
## $ age                <dbl> 48, 31, 30, 20, 29, 28, 24, 42, 33, 34, 33, 3…

tidyr::pivot_longer()

Since I only want to elongate the movies, and all of those columns are logical, I’ll first select which columns I want by using select_if() and is.logical() as the predicate function.

logicols <- select_if(movie_results, is.logical) %>%
  colnames()

movie_long <- movie_results %>%
  pivot_longer(
    cols = one_of(logicols),
    names_to = "movie",
    values_to = "seen"
  )

head(movie_long)
## # A tibble: 6 x 4
##   resp_id   age movie              seen 
##   <chr>   <dbl> <chr>              <lgl>
## 1 1          48 airplane           TRUE 
## 2 1          48 anchorman          TRUE 
## 3 1          48 animal_house       TRUE 
## 4 1          48 the_big_lebowski   FALSE
## 5 1          48 the_blues_brothers TRUE 
## 6 1          48 borat              FALSE

Aside: A nice little trick for recoding TRUE/FALSE as 0 and 1, just use as.numeric().

movie_long %>%
  mutate(seen_num = as.numeric(seen))
## # A tibble: 1,752 x 5
##    resp_id   age movie              seen  seen_num
##    <chr>   <dbl> <chr>              <lgl>    <dbl>
##  1 1          48 airplane           TRUE         1
##  2 1          48 anchorman          TRUE         1
##  3 1          48 animal_house       TRUE         1
##  4 1          48 the_big_lebowski   FALSE        0
##  5 1          48 the_blues_brothers TRUE         1
##  6 1          48 borat              FALSE        0
##  7 1          48 bridesmaids        TRUE         1
##  8 1          48 office_space       FALSE        0
##  9 1          48 old_school         FALSE        0
## 10 1          48 this_is_spinal_tap FALSE        0
## # … with 1,742 more rows

Let’s briefly pretend we’re looking at just three movies: Bridesmaids, Anchorman, and Airplane.

three_movies <- c("airplane", "anchorman", "bridesmaids")
movie_long %>%
  filter(movie %in% three_movies)
## # A tibble: 438 x 4
##    resp_id   age movie       seen 
##    <chr>   <dbl> <chr>       <lgl>
##  1 1          48 airplane    TRUE 
##  2 1          48 anchorman   TRUE 
##  3 1          48 bridesmaids TRUE 
##  4 2          31 airplane    FALSE
##  5 2          31 anchorman   TRUE 
##  6 2          31 bridesmaids TRUE 
##  7 3          30 airplane    FALSE
##  8 3          30 anchorman   FALSE
##  9 3          30 bridesmaids FALSE
## 10 4          20 airplane    FALSE
## # … with 428 more rows

For the wide version, we’ll add the respondent id and age to the list of variables we want to look at.

vars_wanted <- c("resp_id", "age", three_movies)

movie_results %>%
  select(vars_wanted)
## # A tibble: 146 x 5
##    resp_id   age airplane anchorman bridesmaids
##    <chr>   <dbl> <lgl>    <lgl>     <lgl>      
##  1 1          48 TRUE     TRUE      TRUE       
##  2 2          31 FALSE    TRUE      TRUE       
##  3 3          30 FALSE    FALSE     FALSE      
##  4 4          20 FALSE    FALSE     FALSE      
##  5 5          29 FALSE    TRUE      TRUE       
##  6 6          28 FALSE    TRUE      FALSE      
##  7 7          24 TRUE     TRUE      FALSE      
##  8 8          42 TRUE     FALSE     FALSE      
##  9 9          33 TRUE     TRUE      TRUE       
## 10 10         34 TRUE     TRUE      TRUE       
## # … with 136 more rows